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Supercomputing on Massively Parallel Bit-Serial Architectures 


Consider the idea that supercomputing is a synergy of generic 
algorithms, languages and architectures and that real breakthroughs in 
parallel computing will be achieved by considering all three together in a 
simulated software environment. Engineering tradeoffs could be made between 
performance, machine transparency, standardization and program portability 
before any new machines are actually built. Standardized languages could be 
developed for generic subclasses of parallel machines; languages that really 
give high peformance and encourage free parallel expression and "thinking in 
parallel ". 

My own research on the Goodyear MPP (Massively Parallel Processor), 
suggests that high-level parallel languages are practical and can be 
designed with powerful new semantics that allow algorithms to be efficiently 
mapped to the real machines. For the MPP these semantics include parallel/ 
associative array selection for both dense and sparse matrices, variable 
precision arithmetic to trade accuracy for speed, micro-pipelined "train" 
broadcast, and conditional branching at the PE control unit level. 

The preliminary design of a FORTRAN-like parallel language for the HP? 
has been completed and is being used to write programs to perform sparse 
matrix array selection, min/max search, matrix multiplication, Gaussian 
elimination on single bit arrays and other generic algorithms. The MPP 
timing estimate for Gaussian elimination of a 4K by 4K single bit matrix is 
under one second — the equivalent of approximately 64 billion scalar 
operations. Parallel Gauss-Jordan matrix inversion is also being investi- 
gated. The estimated time to invert a 128 X 128, 32 bit real matrix using 
full pivoting on the MPP is 50 msec. This is roughly equivalent to a 100 
MFLOP scalar rate. 

The MPP is a SIMD machine of 16384 single bit processors arranged in a 
128 X 128 array. Individual PE's are interconnected with their four nearest 
neighbors. Each PE can address 1024 bits of its own local memory. A 32 bit 
shift register in each PE allows for micro-pipelining of long words and 
faster partial sum accumulation for multiplication. The machine can execute 
160 billion micro-instructions per second which translates to 800 GOPS for 
some instructions. Operations include single bit logical, shift, and add as 
well as column I/O and one or two dimensional routing in a spiral, 
cyclinder, or torus. All operations can be directly or indirectly masked. 
The logical "or" of one bit per PE (SUMOR) can be used to pass array 
information back to the PE control unit for broadcast to other PE's, scalar 
I/O or conditional branching. If a second MPP were ever built, it might 
look considerably different than the current MPP. For example, it would 
certainly have greater memory depth — at least 64K bits per PE. It might 
also have a reconf igurable bit/byte serial ALU, staged PE's for table lookup 
arithmetic, and pipelined SUMOR logic. 
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SUPERC0MPUTIN6 ON MASSIVELY PARALLEL 
BIT-SERIAL ARCHITECTURES 


• SUPERCOMPUTING DOMAIN 

I NEW DIMENSIONS IN PARALLEL COMPUTING 

• SOME GENERIC ALGORITHMS 

• THE GOODYEAR MPP 

• SOME MPP SPECIFIC ALGORITHMS CODED IN A FORTRAN-LIKE 
BIT-SERIAL PROGRAMMING LANGUAGE 

• WHAT MIGHT A SECOND GENERATION MPP LOOK LIKE? 


SUPERCOMPUTING DOMAIN 


PARALLEL 

PROGRAMS 



SIMULATED SOFTWARE ENVIRONMENT 
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MILLIONS OF OPERATIONS PER SECOND 
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MPP PERFORMANCE WITH INTEGER OPERANDS 
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DIVISION BY | EXAMPLE 


FROM THE BINOMIAL THEOREM, 


I 


| qp y. + x 7 ' 


3 

X 4- 


(x*< ') 


BY A CHANGE OF VARIABLE THEN 







( 3^0 


NOW LET 3 = 2* AND DIVISION BY 2* * ) 
REDUCES TO A SHORT SEQUENCE OF BINARY 
SHIFTS AND ADDS (AND/OR SUBTRACTS), 


D£— - JC.- _ 

| 2** 4 ' + " jz 3 " + 

FOR EXAMPLE, LET V = 237658 AND N = 10 
THEN 


V" 



AFTER 3 


I OZg 


Z32. S. |5" 


SHIFTS AND 2 ADDS 
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THE GOODYEAR MPP 


I SIMD MACHINE OF 1638^4 SINGLE BIT PROCESSORS ARRANGED IN A 
128 X 128 ARRAY 

I NEAREST NEIGHBOR INTERCONNECTIVITY 

I 1024 BITS OF MEMORY PER PE 

I 32 BIT SHIFT REGISTER ALLOWS FOR MICRO-PIPELINING AND 
FASTER MULTIPLICATION 

I EXECUTION SPEED OF 160 BILLION MICRO-INSTRUCTIONS PER SECOND 
WHICH TRANSLATES TO 800 GOPS FOR SOME INSTRUCTIONS 

• OPERATIONS INCLUDE SINGLE BIT LOGICAL, SHIFT, AND ADD AS 
WELL AS COLUMN I/O AND ONE OR TWO DIMENSIONAL ROUTING IN 
A SPIRAL, CYLINDER, OR TORUS 

• ALL OPERATIONS CAN BE DIRECTLY OR INDIRECTLY MASKED 

• THE LOGICAL "OR* OF ONE BIT PER PE (SUMOR) CAN BE USED TO 
PASS ARRAY INFORMATION BACK TO THE PE CONTROL UNIT FOR 
BROADCAST, SCALAR I/O, OR CONDITIONAL BRANCHING 
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ONE OF 16384 MPP PROCESSING ELEMENTS (PE'S) 
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PARALLEL/ASSOCIATIVE ARRAY SELECTION 


I PARALLEL 



S=SUM0R(A[64,256] ) 
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MAXIMUM OF 32 BIT INTEGER ARRAY 
(OF UNIQUE VALUES) 


BIT MAXI I 

INTEGER AI 128, 128 1(0:32) 

MAX=1 

DO 1 1=1,32 

IF (SUMOR(A[MAX]( I))) MAX=A[MAX]( I) 
1 CONTINUE 


; Declare MAX as bit mask 
over all PE's 

; Declare A as a 128 X 128 

UNSIGNED INTEGER ARRAY 

; Initialize MAX to 1 over 
all PE's 

; Scan bits in A from most 

TO LEAST SIGNIFICANT BITS 

; Replace MAX with a new 

SUBSET OF MAXIMUM VALUES 
FOR EACH NON ZERO BIT 
PLANE OF A 


MAXIMUM OF 32 BIT INTEGER ARRAY 
(GENERAL CASE) 


BIT MAXI ],T[ ]( 46), INDEX! K14) 

INTEGER A[ 128, 128] (0:32) 

COMMON /INIT/ INDEX ; Same algorithm as before 

EXCEPT A ARRAY IS FIRST 
CONCATENATED WITH THE 
PE ADDRESS FIELD TO INSURE 
UNIQUENESS OF RESULT 

MAX=1 


T=A.C0N. INDEX 
DO 1 1=1,46 

IF (SUMOR(TIMAXKI))) MAX=T [ MAX ] ( I ) 
1 CONTINUE 


1-155 


MATRIX MULTIPLICATION EXAMPLE 



CP 


X 

M 

00 




REAL A[ 8, 16,128 ] (8 : 32), B[ 8, 16, 128) (8: 32) 
08,16,128] (8:32),T18,16, 1281 (8:32) 

READ A[ , ,1) ,B( 1, , ] 

T“A[ 1*BI 1. . J 
C— T[ ,+, ) 

PRINT 0,1,] 


COLUMN BROADCAST EXAMPLE 



REAL AI128, 1281(8:32) 
A=ALJ...] 


OR 


REAL A[ 128,1281(8:32) 
BIT M[ ] 

M-[128,128j ,JI 
A=A[ .NOT. Mil ,128 -H 
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COLUMN BROADCAST EXAMPLE 


PROBLEM: 

SOLUTION 

SOLUTION 


TO BROADCAST A COLUMN OF FLOATING POINT NUMBERS 
ACROSS THE MPP ARRAY 

#1: WITH PE'S INTERCONNECTED IN AN E/W CYLINDER; 

LOAD, SHIFT AND STORE THE 32 BIT VALUES 
ACROSS THE ARRAY. THIS TAKES APPROXIMATELY 
3 X 32 X 128 * 12288 CYCLES. 

#2: WITH PE'S INTERCONNECTED IN AN E/W CYLINDER; 

"TRAIN" BROADCAST THE 32 BIT VALUES ACROSS 
THE ARRAY. THIS CAN BE VIEWED AS A MICRO - 
PIPELINING OPERATION AND TAKES ONLY 207 CYCLES. 

THE ALGORITHM IS AS FOLLOWS: 

I GET "TRAIN" OF 1 STOP BIT + 32 BIT VALUES 
OUT ONTO THE E/W PE CHANNEL ( - 33 CYCLES) 

I CIRCULATE "TRAIN" ONCE AROUND ( * 128 CYCLES). i 
DURING THIS PROCESS INDIVIDUAL PE'S WILL 
STORE THE "TRAIN" IN THEIR SHIFT REGISTERS. 
SHIFTING STOPS WHEN THE STOP BIT ENTERS THE 
CONDITIONAL MASK REGISTER OF EACH PE. 

I STORE ALL SHIFT REGISTERS ( * 32 CYCLES). j 
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ELIMINATION EXAMPLE 











GAUSSIAN ELIMINATION EXAMPLE 


BIT A[4000,4](1000),M[4000,4],USED(4000) 

INTE6ER PIVOT(4000,0:14)J1(0:2),J2(0:12)J(0:14) 
EQUIVALENCE <J1,J(1)MJ2,J<3)> 


READ A 

DO 1 1=1,4000 
USED( I )=0 

1 CONTINUE 

DO 7 1=1,4000 
DO 2 J2=l,1000 

IF (SUMOR(A[ I, ](J2))) 60 TO 3 

2 CONTINUE 
60 TO 8 

3 CONTINUE 

DO 4 Jl-1,4 

IF (SUMOR(A[ LJ1KJ2))) 60 TO 5 

4 CONTINUE 

5 CONTINUE 
PIVOT( I)=J 
USED(J)=1 

M=A[ ](J2).AND..N0T.[4O00,4;IJl] 


DO 6 J2=l,1000 

A[ ](J2)=A[ ](J2).X0R.MU1...] 


6 CONTINUE 

7 CONTINUE 

8 CONTINUE 


; Read in array 
; Initialize history matrix 


; Search for a 1 in row I 

IN STEPS OF 4 COLUMNS 
J Row OF ALL 0's - EXIT 
; Find which column of 4 


j Save history information 
; Save pivot column in new 

MATRIX M, ZEROING THE PIVOT 
ROW VALUE 

; Eliminate 4 columns at a time 

BY BROADCASTING THE PIVOT 
COLUMN ACROSS THE M ARRAY 
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GAUSS- JORDAN MATRIX INVERSION 


WITH FULL PIVOTING 
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PARALLEL DATA STRUCTURES 


REAL ARRAYS 

U - I A : I I AUGMENTED MATRIX 

V - C : I WORKING ARRAY 

W - i : ] WORKING ARRAY 

BIT MASKS 

X - I I : 0 ] PIVOTED ROW/COLUMNS 

Y - I I : T ] PIVOT ROW 

WHERE I IS THE IDENTITY MATRIX 
T IS THE UNITY MATRIX 
0 IS THE ZERO MATRIX 
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OTHER DATA STRUCTURES 


SCALARS 


DET - 1 


PIVOT 
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PARALLEL APPROACH TO MATRIX IHVERSIOR 


REPEAT FOLLOWING STEPS N TIMES 

• FIND NEXT PIVOT 

• UPDATE DETERMINATE (OPTIONAL) 

• ZERO PIVOT ROW AND COLUMN IN X 

• ZERO PIVOT ROW IN Y 

• NORMALIZE PIVOT ROW IN U 

• BROADCAST PIVOT ROW N TIMES INTO V 

• BROADCAST PIVOT COLUMN 2N TIMES INTO W 

• PERFORM PARALLEL ROW OPERATIONS FOR A 
SINGLE PIVOT 

• RESET PIVOT ROW IN Y 


THEN REORDER ROWS IN U TO FORM 
U - [ I : A' 1 J 
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PARALLEL MATRIX INVERSION AL60RITHN 


FOR I - 1 TO N 

PIVOT - MAX|u| PER X 
DET - DET * PIVOT 



0 

0 

uUmhmmJ/ pivot 



{I ! ] 


U - U - V * W PER Y 




1 


END I 

FOR J - 1 TO N 
FOR I - 1 TO N 

IF UCI,J] * 1 THEN VCJ,*] - Utl,*] 
END I 
END J 
U * V 
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MPP II: 

WHAT MIGHT IT LOOK LIKE? 


• MUCH GREATER MEMORY DEPTH: AT LEAST 64K BITS 
PER PE, WITH AT LEAST ONE LEVEL OF INDIRECT 
ADDRESSING. 

I RECONFIGURABLE BIT/NIBBLE/BYTE SERIAL ALU 

• STAGED PE'S FOR TABLE LOOKUP ARITHMETIC. 

HOW MANY TABLES? WHAT SIZE? RAM OR ROM? 

• PIPELINED SUMOR LOGIC 
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